Nowadays, in this era of digital payments, due to rapid growth in field of cashless or digital
transactions, credit cards are widely used in all around the world and you cannot deny the fact that we all are moving towards a cashless society and, you cannot always stick to traditional payment methods. At the same time, credit card fraud has become one of the major concerns.
Credit cards providers are issuing thousands of cards to their customers. Providers have to ensure all the credit card users should be genuine and real. Any mistake in issuing a card can be reason of financial crises, and because of the rapid growth in cashless transaction, the chances of number of fraudulent transactions can also increasing. Therefore credit card fraud detection has become one of the most important aspects.
A Fraud transaction can be identified by analyzing various behaviors of credit card customers from previous transaction history datasets. If any deviation is noticed in spending behavior from available patterns, it is possibly of fraudulent transaction.
In order to deal with the issue of credit card fraud and detect those frauds we will use an algorithm called isolation forest. First, it's important to uderstand that among all the creadit card's transactions in our data, there are some unusual data which are the frauds. What we want to find is a strange behavior in some of data that deviates from the others. In other words a fraud is an Anomaly that we want to find.
See an example for anomaly below:

The dataset we'll use is a synthetic dataset for a mobile payments application. In this dataset, we'll have the sender and recipient of a transaction as well as whether transactions are tagged as fraud or not fraud.
For this project we used the data set from Kaggle: Dataset
This Data contains the ten following fields:
import pandas as pd
data = pd.read_csv('transactions_train.csv')
First, lets take a look at some general inforamtion about our data.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6351193 entries, 0 to 6351192 Data columns (total 10 columns): # Column Dtype --- ------ ----- 0 step int64 1 type object 2 amount float64 3 nameOrig object 4 oldbalanceOrig float64 5 newbalanceOrig float64 6 nameDest object 7 oldbalanceDest float64 8 newbalanceDest float64 9 isFraud int64 dtypes: float64(5), int64(2), object(3) memory usage: 484.6+ MB
Above we can see for example that there are no null values in our data
We can also learn it with the folowing code
data.isnull().sum()
step 0 type 0 amount 0 nameOrig 0 oldbalanceOrig 0 newbalanceOrig 0 nameDest 0 oldbalanceDest 0 newbalanceDest 0 isFraud 0 dtype: int64
Now lets see all the columns of our data, as we've already described above
data.columns
Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrig',
'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest',
'isFraud'],
dtype='object')
And here we can learn how much rows (samples) and colums there are in our csv data file.
data.shape
(6351193, 10)
Now, let's see if we have dupicates in our data
data.duplicated().sum()
0
As we can see in the results above - there are no duplicates
Now, lets take a look at the first 5 rows of our data, so we can get a better understanding about it
data.head()
| step | type | amount | nameOrig | oldbalanceOrig | newbalanceOrig | nameDest | oldbalanceDest | newbalanceDest | isFraud | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | PAYMENT | 9839.64 | C1231006815 | 170136.0 | 160296.36 | M1979787155 | 0.0 | 0.0 | 0 |
| 1 | 1 | PAYMENT | 1864.28 | C1666544295 | 21249.0 | 19384.72 | M2044282225 | 0.0 | 0.0 | 0 |
| 2 | 1 | TRANSFER | 181.00 | C1305486145 | 181.0 | 0.00 | C553264065 | 0.0 | 0.0 | 1 |
| 3 | 1 | CASH_OUT | 181.00 | C840083671 | 181.0 | 0.00 | C38997010 | 21182.0 | 0.0 | 1 |
| 4 | 1 | PAYMENT | 11668.14 | C2048537720 | 41554.0 | 29885.86 | M1230701703 | 0.0 | 0.0 | 0 |
We can use pandas lib to see some more information about the data. Let's take a look
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| step | 6351193.0 | 2.425553e+02 | 1.410676e+02 | 1.0 | 155.00 | 238.00 | 334.00 | 6.990000e+02 |
| amount | 6351193.0 | 1.798155e+05 | 6.036310e+05 | 0.0 | 13388.29 | 74864.83 | 208715.19 | 9.244552e+07 |
| oldbalanceOrig | 6351193.0 | 8.347957e+05 | 2.889959e+06 | 0.0 | 0.00 | 14153.00 | 107346.00 | 5.958504e+07 |
| newbalanceOrig | 6351193.0 | 8.561696e+05 | 2.926073e+06 | 0.0 | 0.00 | 0.00 | 144365.15 | 4.958504e+07 |
| oldbalanceDest | 6351193.0 | 1.101043e+06 | 3.398924e+06 | 0.0 | 0.00 | 133086.55 | 943866.12 | 3.560159e+08 |
| newbalanceDest | 6351193.0 | 1.225372e+06 | 3.674293e+06 | 0.0 | 0.00 | 214919.01 | 1112791.08 | 3.561793e+08 |
| isFraud | 6351193.0 | 1.215047e-03 | 3.483635e-02 | 0.0 | 0.00 | 0.00 | 0.00 | 1.000000e+00 |
Now, we would like to talk about skewness.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Skewness is computed for each row/column in the DataFrame.
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and — 0.5 or between 0.5 and 1, the data are moderately skewed
If the skewness is less than -1 or greater than 1, the data are highly skewed
data_for_skew = data.drop(columns=['type', 'nameOrig', 'nameDest']) # get rid of the not numeric fields
skew = data_for_skew.skew().sort_values(ascending =False)
skew_df= pd.DataFrame({'skew':skew})
skew_df.head(10)
| skew | |
|---|---|
| amount | 31.050928 |
| isFraud | 28.635901 |
| oldbalanceDest | 19.934164 |
| newbalanceDest | 19.362310 |
| oldbalanceOrig | 5.243790 |
| newbalanceOrig | 5.172421 |
| step | 0.338249 |
Let's look at the columns that are highly skewed
skew_df[(skew_df['skew']>=1) |(skew_df['skew']<=-1) ].index
Index(['amount', 'isFraud', 'oldbalanceDest', 'newbalanceDest',
'oldbalanceOrig', 'newbalanceOrig'],
dtype='object')
we can see that all the of the columns exept of 'step' are highly skewed.. Therefore, Our data in general is highly skewed and because of that we better remember to correct it before modelling.
Now, we'll imoprt relevant libraries for ploating graphs
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
Lets take a look at a plot representing the distribution of the 'step' variable
fig = sns.displot(data=data['step'], kde=False, color="blue", height=9, aspect=1.4)
plt.show()
As we can see in the plot above, the normal behavior of the transactions regarding to steps (1 step is 1 hour): between 0-50 and 150-400
and the other ones might be anomalys, however, maybe we do not have enough data that covers the less common areas.
Now let's be convinced that the data we are using is highly unbalanced.
A reminder. In our data 'isFraud' take value of 1 in case of fraud and 0 otherwise
plt.figure(figsize=(12,10))
fig = sns.countplot(x="isFraud", data=data)
plt.show()
Now, after we saw that the data we are using is highly unbalanced, lets see the numbers. We want to know exactly how much of the data samples are frauds and how many are regular transactions.
For this we will use pandas library.
fraud = data[data['isFraud'] == 1] # frauds
valid = data[data['isFraud'] == 0] # regular transactions
fraud_percentage = 100 * (len(fraud) / (len(data)))
outlier_fraction = len(fraud) / len(valid)
print(f'All Cases: {len(data)} samples')
print(f'Fraud Cases: {len(fraud)} samples')
print(f'Valid Cases: {len(valid)} samples')
print(f'Fraud Percentage: {round(fraud_percentage, 3)} %')
print(f'Outlier Fraction: {round(outlier_fraction, 3)}')
All Cases: 6351193 samples Fraud Cases: 7717 samples Valid Cases: 6343476 samples Fraud Percentage: 0.122 % Outlier Fraction: 0.001
According those results we now understand how little samples of fraud we truely have concerning all of our data.
Now, we are going to plot some graphs whill using only 50,000 samples of the data because the full data is too large for these plots
sns.pairplot(data.iloc[0:50000], hue= 'isFraud') # only 50,000 samples
<seaborn.axisgrid.PairGrid at 0x26170df6980>
The orange dots/lines in the plots above represnt the frauds and we can see that in some cases they are seperated from the normal(blue) distribution pattern
Now, lets take a look a the 'type' of our data
sns.countplot(x='type', data=data)
plt.xticks(rotation=45)
(array([0, 1, 2, 3, 4]), [Text(0, 0, 'PAYMENT'), Text(1, 0, 'TRANSFER'), Text(2, 0, 'CASH_OUT'), Text(3, 0, 'DEBIT'), Text(4, 0, 'CASH_IN')])
We can learn that most of our data is of the types of Patment, Cash_out, Cash_in. There are not lots of data of type of Transfer, Debit
lets see how much
labels = data['type'].astype('category').cat.categories.tolist()
counts = data['type'].value_counts()
sizes = [counts[category] for category in labels]
type_fig, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=False) #autopct is show the % on plot
ax1.axis('equal')
plt.show()
It is clear now that Transfer and Debit representing only 9.1% of the data
Now, it'll be interesting to check what are the types of the frauds
fraud = data[data['isFraud'] == 1] # frauds
types = fraud['type'].unique()
print(types)
['TRANSFER' 'CASH_OUT']
As we can see, our frauds according our data occurs only when the type is of 'TRANSFER' or 'CASH_OUT'. Which means that when moddeling we might want to consider discard all the other types as we now know that they dosn't contain any frauds, so we will not be able to verifiy if our model will be able to detect frauds from these types.
Let's take a look of the presentage of the frauds of different types from all the frauds
types =['TRANSFER', 'CASH_OUT', 'DEBIT', 'PAYMENT', 'CASH_IN']
for fraud_type in types:
fraud_type_percentage = 100 * (len(fraud[fraud['type'] == fraud_type]) / (len(fraud)))
print(f'Fraud {fraud_type}: {round(fraud_type_percentage, 3)} %')
Fraud TRANSFER: 49.877 % Fraud CASH_OUT: 50.123 % Fraud DEBIT: 0.0 % Fraud PAYMENT: 0.0 % Fraud CASH_IN: 0.0 %
As we can see the frauds type of 'TRANSFER', 'CASH_OUT' are balanced
Now, we would like to compute the pairwise corelation of the columns of our data.
We'll use the deafault corelation which is the Pearson's Correlation. It's a linear correlation coefficient that returns a value between -1 and 1.
While -1 means there is a strong negative correlation,
+1 means that there is a strong positive correlation.
And 0 means that there is no correlation.
You can also see the colors' bar that representing the numbers in the correlation matrix figure
correlation_matrix = data.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(correlation_matrix, cmap='coolwarm')
plt.show()
We can learn from the metrix above that there are a lot of the values are close to 0 (the blue tone ones) which means that most of them are fairly unrelated.
Also, we can see that along the main diagonal the squares values are close to 1 (the red tone) which signify a stronger correlation.
What's intresting is that we have a medium creletion between the 'amount' and the newbalanceDest/newbalanceOrig
Now let's zoom in on the correlation between the colomns and the 'isFroud' column
data.corr()['isFraud'][:-1].plot.barh(figsize=(8,6),alpha=.6,color='darkblue')
plt.xlim(-.075,.075);
plt.xticks([-0.065, -0.05 , -0.025, 0. , 0.025, 0.05 , 0.065], [str(100*i)+'%' for i in [-0.065, -0.05 , -0.025, 0. , 0.025, 0.05 , 0.065]],fontsize=12)
plt.title('Correlation between "isFraud" and numerical variables',fontsize=14);
plt.grid()
It's clear tat there is a clear relation between 'amount' and 'step' variables and 'isFraud'.
Now, we will look at the histogram of each column in our data.
Just a reminder that a histogram is a representation of the distribution of data.
data.hist(figsize=(15, 10), color="green")
plt.show()
From the histograms above, first we can learn that most of the columns in our data has a limited range of values, and also some times have few outliers (for exsample 'amount', 'oldbabanceOrig').
Second, as we've already saw in another figure above, we can see again in the 'isFraud' histogram that our data is highly unbalanced (Most of the samples are not fraud and only few are fraud).It's really had to see the frauds (values of one, it looks like it's not even there)
Now we would like to try and see if we can differentiate the fraud transaction from the valid transactions by ploting them for each numeric column
# get the relevant columns
relevant_columns = [item for item in list(data) if item not in ['step', 'isFraud', 'type', 'nameOrig', 'nameDest']]
fraud = data[data['isFraud'] == 1]
for column in relevant_columns:
plt.figure(figsize=(15,7))
plt.scatter(x=data['step'], y=data[column], color="blue")
plt.scatter(x=fraud['step'], y=fraud[column], color='c')
plt.title(f"Frauds marked in figure - '{column}' depending on the variable 'step'")
plt.xlabel('step')
plt.ylabel(column)
plt.show()
In the figures above we can see that many of the frauds in some of the figures distribute in the same area as a line.
For exsample figures of 'newbalanceDest', 'oldbalanceDest'. In addition looks like mose of the fraud values are tend to be in the lower part of the values range
As we said and saw till now in the notebook, our data is hightly imbalanced and skewed. Therefore before starting to modeling we first want to fix this.
We want to use PCA to reduce the current dimensional of our data into less dimensions so that we can plot again and hopefully understand the data better.
PCA is effected by scale so first we need to scale the features in our data before applying PCA.
For this we'll use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of later algorithms.
from sklearn.preprocessing import StandardScaler
We want to get rid of all the not numeric columns + step
numeric_no_step_data = data.drop(['type', 'nameOrig', 'nameDest', 'step'], axis=1)
Then, we'll standardize the dataset’s features onto unit scale
features = numeric_no_step_data.columns
x = numeric_no_step_data.loc[:, features].values # Separating out the features
y = numeric_no_step_data.loc[:,['isFraud']].values # Separating out the isFraud
# Standardizing the features
x = StandardScaler().fit_transform(x)
Now let's do the PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=4, svd_solver='full')
principalComponents = pca.fit_transform(x)
Now we can see our data after the PCA
principalDf = pd.DataFrame(data = principalComponents, columns = ['v1', 'v2', 'v3', 'v4'])
principalDf.head()
| v1 | v2 | v3 | v4 | |
|---|---|---|---|---|
| 0 | -0.621581 | 0.111652 | -0.045378 | -0.071159 |
| 1 | -0.651714 | 0.174931 | -0.049722 | -0.085654 |
| 2 | -0.103892 | 0.535938 | 27.560581 | -7.902563 |
| 3 | -0.100178 | 0.537205 | 27.559867 | -7.904859 |
| 4 | -0.643067 | 0.170917 | -0.045711 | -0.071204 |
We would like to add the 'isFraud' colum to our data again. Let's do it
finalDf = pd.concat([principalDf, data[['isFraud', 'step']]], axis = 1)
finalDf.head()
| v1 | v2 | v3 | v4 | isFraud | step | |
|---|---|---|---|---|---|---|
| 0 | -0.621581 | 0.111652 | -0.045378 | -0.071159 | 0 | 1 |
| 1 | -0.651714 | 0.174931 | -0.049722 | -0.085654 | 0 | 1 |
| 2 | -0.103892 | 0.535938 | 27.560581 | -7.902563 | 1 | 1 |
| 3 | -0.100178 | 0.537205 | 27.559867 | -7.904859 | 1 | 1 |
| 4 | -0.643067 | 0.170917 | -0.045711 | -0.071204 | 0 | 1 |
now lets plot again some plots we plot before
finalDf.hist(figsize=(10, 20), color="green")
plt.show()
Here we didn't get much exept that now the plots are clearer and easier to look at regarding to the less common data ranges
fraud = finalDf[finalDf['isFraud'] == 1]
plt.figure(figsize=(15,7))
plt.scatter(x=finalDf['step'], y=finalDf['v3'], color="blue")
plt.scatter(x=fraud['step'], y=fraud['v3'], color='c')
plt.title(f"Frauds marked in figure - '{'v3'}' depending on the variable 'step'")
plt.xlabel('step')
plt.ylabel('v3')
plt.show()
We can see in this plot above that now we was able to sperate the frauds pattern from the valid transaction pattern nicely! as compare to the same plot we saw before the PCA. The other graphs from this kind didn't bring any new information
sns.pairplot(finalDf.iloc[0:50000], hue= 'isFraud') # only 50,000 samples
<seaborn.axisgrid.PairGrid at 0x261c28cf700>
Here too, We can see in this plots above that now many of them are sperate the frauds pattern from the valid transaction pattern nicely! as compare to the same plot we saw before the PCA.
For example 'v4'
Ok, now we would like to see if we can use PCA again, but now to reduce the current dimensional of our data into even less dimensions then 4. lets try to reduce it to 3, and see if our plots results improves.
pca2 = PCA(n_components=3, svd_solver='full')
principalComponents2 = pca2.fit_transform(x)
principalDf2 = pd.DataFrame(data = principalComponents2, columns = ['v1', 'v2', 'v3'])
finalDf2 = pd.concat([principalDf2, data[['isFraud', 'step']]], axis = 1)
finalDf2.head()
| v1 | v2 | v3 | isFraud | step | |
|---|---|---|---|---|---|
| 0 | -0.621581 | 0.111652 | -0.045378 | 0 | 1 |
| 1 | -0.651714 | 0.174931 | -0.049722 | 0 | 1 |
| 2 | -0.103892 | 0.535938 | 27.560581 | 1 | 1 |
| 3 | -0.100178 | 0.537205 | 27.559867 | 1 | 1 |
| 4 | -0.643067 | 0.170917 | -0.045711 | 0 | 1 |
Again, lets see if now we get better results for the plots
# get the relevant columns - evary column exept of 'Time' and 'Class'
relevant_columns = [item for item in list(finalDf2) if item not in ['step', 'isFraud']]
fraud = finalDf2[finalDf2['isFraud'] == 1]
for column in relevant_columns:
plt.figure(figsize=(15,7))
plt.scatter(x=finalDf2['step'], y=finalDf2[column], color="blue")
plt.scatter(x=fraud['step'], y=fraud[column], color='c')
plt.title(f"Frauds marked in figure - '{column}' depending on the variable 'step'")
plt.xlabel('step')
plt.ylabel(column)
plt.show()
sns.pairplot(finalDf2.iloc[0:50000], hue= 'isFraud') # only 50,000 samples
<seaborn.axisgrid.PairGrid at 0x261b97c02b0>
We can see in all the plots above that we got similar information as for the 4 dimensions PCA. so we dicided to keep working with the 3 dimensions.
Note, that we checked with 2 dimensions, but we lost the information of the plot above of 'v3' which gives us a nice seperation betwwen the frauds and the valid distribution.
Now, We want to use Logistic Regression to model our data so it can detect which of the samples are frauds.
Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables.
Since the outcome is a probability, the dependent variable is bounded between 0 and 1.
Before implementing any algorithm we need to arranging our data so we can work with it.
Now, lets use the 'train_test_split' function, to split our dataset into random train and test subsets.
from sklearn.model_selection import train_test_split
finalDf_withoud_isFraud = finalDf2.drop('isFraud',axis=1)
finalDf_only_isFraud = finalDf2.drop(['v1', 'v2', 'v3', 'step'], axis=1)
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(finalDf_withoud_isFraud, finalDf2['isFraud'].values.reshape(-1,1), test_size=0.30,
random_state=101, stratify=finalDf2['isFraud'])
Before, we move forward we would like to cover some relevant definitions so it will be easier to understand the results later
It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions. true positive rate.
It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent. positive predictive value.
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.
Now, let's import LogisticRegression and create our model
from sklearn.linear_model import LogisticRegression
import time
logmodel = LogisticRegression()
reshape_y_train_log = y_train_log.reshape(-1)
start = time.time()
logmodel.fit(X_train_log, reshape_y_train_log)
stop = time.time()
predictions = logmodel.predict(X_test_log)
train_time_of_logistic_regression = stop - start
print(f"Training time: {train_time_of_logistic_regression}s")
Training time: 12.26198148727417s
from sklearn.metrics import classification_report
print(classification_report(y_test_log, predictions))
precision recall f1-score support
0 1.00 1.00 1.00 1903043
1 1.00 1.00 1.00 2315
accuracy 1.00 1905358
macro avg 1.00 1.00 1.00 1905358
weighted avg 1.00 1.00 1.00 1905358
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
print('Accuracy score on test data: ', accuracy_score(y_test_log, predictions))
print('ROC AUC score:', roc_auc_score(y_test_log, predictions))
Accuracy score on test data: 1.0 ROC AUC score: 1.0
As we can see we've got great results.. however on secound thought we used an highly imbalenced data - we can see we have 1903043 valid samples and only 2315 frauds.
So both our train set and test set contains mostly valid samples. Although we use 'stratify' as attr of train_test_split function above which helps to make sure that the same ratio between the valid and fraud samples in the data is being kept in the test set, it isn't enough because we start off with an highly imbalenced data, so the ratio too not even close to 50%
Therefore we dicided to do the linear regression again, however now, before modeling we are going to oversampling. Lets first, explain why
We have already established that the original dataset is highly imbalanced. This is generally problematic, as the model trained on such data will have difficulties recognising the minority class. This becomes an even bigger problem when we are not interested in just predicting the outcome (we know that if we assume that a transaction is not fraudulent we will be correct in 99.83% of the time), but detecting instances of the minority class (e.g. fraud).
There are two common techniques for addressing class imbalance that are often used in practice, and both introduce bias into the dataset in order to equalize the representation of all classes.
Undersampling - undersampling techniques remove observation from the dominant class to reduce the overrepresented/underrepresented gap. Random undersampling, for example, randomly removes samples (with or without replacement), quite often until the number of observations in the majority class becomes identical to the number of observations in the minority class.
Oversampling - oversampling also aims to reduce the class counts discrepancy, but unlike undersampling it achieves this by increasing the number of instances in the minority class. There are different approaches to this strategy, with the two most commonly used being random oversampling and SMOTE. Random oversampling is a fairly straightforward solution, which simply makes multiple copies of existing minority class observations, thus increasing the number of total observations from the minority class. Synthetic Minority Over-sampling Technique (SMOTE), on the other hand, oversamples the minority class by creating synthetic examples. It has been shown that SMOTE outperforms simple undersampling.
Bcause in our case there are only 7717 fraud samples (as discribed above) if we do Undersampling, we will be left with very little data, with is problematic as well.
Therefore, we dicided to use the Oversampling approach with the SMOTE
First, let's import the SMOTE
from imblearn.over_sampling import SMOTE
Now we'll use SMOTE to Oversampling. we'll do it both to the train set and the test set
X_train_smote, y_train_smote = SMOTE(random_state=888).fit_resample(X_train_log, y_train_log)
x_test_smote, y_test_smote = SMOTE(random_state=567).fit_resample(X_test_log, y_test_log)
#smote_value_counts = y_train_smote["isFraud"].value_counts()
print("Train Fraudulent transactions are %.2f%% of the train set." % (y_train_smote.tolist().count(1) * 100 / len(y_train_smote)))
print("Test Fraudulent transactions are %.2f%% of the test set." % (y_test_smote.tolist().count(1) * 100 / len(y_test_smote)))
Train Fraudulent transactions are 50.00% of the train set. Test Fraudulent transactions are 50.00% of the test set.
And we also can see that the data is now balanced as we see below
balanced For the train set:
plt.hist(y_train_smote);
plt.show()
And also for the test set:
plt.hist(y_test_smote);
plt.show()
Now that the data imbalance has been resolved, we can move forward with the actual model training.
logmodel = LogisticRegression()
start = time.time()
logmodel.fit(X_train_smote, y_train_smote)
stop = time.time()
train_time_of_logistic_regression = stop - start
print(f"Training time: {train_time_of_logistic_regression}s")
predictions = logmodel.predict(x_test_smote)
Training time: 27.634199142456055s
As we can see the training time of the model with balance data using the oversampling is higher. Which ofcorce make sense since with used much more data for this training
print(classification_report(y_test_smote, predictions))
precision recall f1-score support
0 1.00 1.00 1.00 1903043
1 1.00 1.00 1.00 1903043
accuracy 1.00 3806086
macro avg 1.00 1.00 1.00 3806086
weighted avg 1.00 1.00 1.00 3806086
print('Accuracy score on test data: ', accuracy_score(y_test_smote, predictions))
print('ROC AUC score:', roc_auc_score(y_test_smote, predictions))
Accuracy score on test data: 0.999999474525799 ROC AUC score: 0.999999474525799
As we can see, the diffrance between the balance data and the imbalance data gives us diffrance in the results of the models in the Accuracy and the ROC AUC therefore the model with the balance data showing less overfiting thus the model is bit better.
import shap
importance = logmodel.coef_[0]
feat_importances = pd.Series(importance)
feat_importances.nlargest(4).plot(kind='barh',title = 'Feature Importance')
<AxesSubplot:title={'center':'Feature Importance'}>
#X_train_smote_sample = shap.sample(X_train_smote, 7000)
#explainer = shap.KernelExplainer(logmodel.predict, X_train_smote_sample, keep_index=True)
#shap_values = explainer.shap_values(X_train_smote_sample)
#shap_model = model_linear_regression(pipe=LINEAR_PIPE, inverse=True)
#shap.summary_plot(shap_values, X_train_smote_sample)
The isolated forest is used for anomaly detection.
Anomaly detection is the identification of events in a dataset that don't conform to the expected pattern.
Now, what we want to do is to utillze Isolation forest to detect such anomalies. In our case those anomalies represents the frauds.
Isolated forest is an clustering algorithm that belongs to the ensemble decision trees family and is similar in principle to Random Forest. And since there are no pre-defined labels here, it is an unsupervised model.
Isolation Forests were built based on the fact that anomalies are the data points that are “few and different”.
In an Isolation Forest, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they required more cuts to isolate them. Similarly, the samples which end up in shorter branches indicate anomalies as it was easier for the tree to separate them from other observations.
The algorithm starts with the training of the data, by generating Isolation Trees.
First, when given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.
second, branching of the tree starts by selecting a random feature (from the set of all features - in our case the columns), And then branching is done on a random threshold.
then, if the value of a data point is less than the selected threshold, it goes to the left branch else to the right.
This process from of the branching is continued recursively till each data point is completely isolated (the leafs) or till reached to max depth.
then, those steps are repeated to construct random binary trees.
Now, an 'anomaly score' is assigned to each of the data points based on the depth of the tree required to arrive at that point, While an anomaly score of -1 is assigned to anomalies and 1 to normal.
The images above will help you get a better understanding of the algorithm


Before implementing any algorithm we need to arranging our data so we can work with it.
First, we want separate the valid and fraud transaction samples so we can train the Isolation Forest using the valid transaction.
Also, we want to get rid of the column 'Class' which tells us if the sample is fraud or not.
import numpy as np
valid_train = []
valid_test = []
fraud_train = []
fraud_test = []
new_X_train_smote = X_train_smote.copy()
new_X_train_smote['isfraud'] = y_train_smote.tolist()
valid_train = new_X_train_smote[new_X_train_smote['isfraud'] == 0].drop(['isfraud'], axis=1)
fraud_train = new_X_train_smote[new_X_train_smote['isfraud'] == 1].drop(['isfraud'], axis=1)
new_x_test_smote = x_test_smote.copy()
new_x_test_smote['isfraud'] = y_test_smote.tolist()
valid_test = new_x_test_smote[new_x_test_smote['isfraud'] == 0].drop(['isfraud'], axis=1)
fraud_test = new_x_test_smote[new_x_test_smote['isfraud'] == 1].drop(['isfraud'], axis=1)
fraud = pd.concat([fraud_test, fraud_train], ignore_index=True, sort=False)
Now lets see that we've got this part right - means that we've separated the valid and fraud transaction samples and both have no column 'Class' anymore.
fraud.head()
| v1 | v2 | v3 | step | |
|---|---|---|---|---|
| 0 | 3.243263 | 0.902683 | 29.488307 | 208 |
| 1 | -0.091486 | 0.537297 | 27.567727 | 387 |
| 2 | 0.068935 | 0.554875 | 27.660118 | 523 |
| 3 | 0.344202 | 0.585035 | 27.818652 | 480 |
| 4 | -0.001396 | 0.572572 | 27.546022 | 511 |
valid_train.head()
| v1 | v2 | v3 | step | |
|---|---|---|---|---|
| 0 | 0.509961 | 0.646714 | 0.003358 | 162 |
| 1 | 0.188095 | -1.764515 | 0.019631 | 11 |
| 2 | -0.502297 | 0.175290 | 0.012676 | 207 |
| 3 | 0.050355 | 0.422246 | -0.113344 | 155 |
| 4 | -0.538323 | 0.238191 | 0.006157 | 230 |
Now, lets use the 'train_test_split' function, to split our dataset into random train and test subsets.
First we want to import the folowings
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
We also want to import the 'time' library so we can measure the time duration of the training.
import time
model = IsolationForest()
start = time.time()
# we will use the valid_train data to train our model
model.fit(valid_train)
stop = time.time()
train_time_of_isolation_forest = stop - start
print(f"Training time: {train_time_of_isolation_forest}s")
Training time: 9.397419691085815s
now lets look at the score of our model when givan new unseen valid data samples and also fraud data samples.
Attention, for each sample, we'll get 1 for an inlier or -1 for an outlier according to the fitted model.
valid_pred_test = model.predict(valid_test)
fraud_pred = model.predict(fraud)
print("Accuracy in Detecting Valid transactions samples:", list(valid_pred_test).count(1)/valid_pred_test.shape[0])
print("Accuracy in Detecting Fraud transactions samples:", list(fraud_pred).count(-1)/fraud_pred.shape[0])
Accuracy in Detecting Valid transactions samples: 0.8751867403942002 Accuracy in Detecting Fraud transactions samples: 1.0
As we can see from the result above, the accuracy of the model tend to False Alarm on the valid test. Our suspicions is that it cased by the imbalanced 'type' data - the type of data that have low samples numbers which can influance the numeric values, can be cosidered as fraud because it can be deviation from the others. Anyway, in case of creadit card transactions we rather have FA instead of miss detections. So in general the results are good.
labels = data['type'].astype('category').cat.categories.tolist()
counts = data['type'].value_counts()
sizes = [counts[category] for category in labels]
type_fig, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=False) #autopct is show the % on plot
ax1.axis('equal')
plt.show()
now, score the data to obtain anomaly scores
new_X_train_smote.head()
new_x_test_smote.head()
all_over_sampling_data = pd.concat([new_X_train_smote, new_x_test_smote], ignore_index=True, sort=False)
data_without_answer = all_over_sampling_data.drop(['isfraud'], axis=1)
data_without_answer['scores'] = model.decision_function(data_without_answer)
data_without_answer['anomaly_score'] = model.predict(all_over_sampling_data.drop(['isfraud'], axis=1))
data_without_answer[data_without_answer['anomaly_score']==-1].head()
| v1 | v2 | v3 | step | scores | anomaly_score | |
|---|---|---|---|---|---|---|
| 1 | 0.188095 | -1.764515 | 0.019631 | 11 | -0.006815 | -1 |
| 5 | 1.208166 | 0.867705 | -0.255149 | 15 | -0.031883 | -1 |
| 9 | 3.842720 | 1.840759 | -0.564635 | 322 | -0.142147 | -1 |
| 22 | 1.703308 | -5.836487 | 0.059811 | 357 | -0.083680 | -1 |
| 26 | 2.177707 | 1.228721 | -0.367459 | 185 | -0.025172 | -1 |
Above, we can see that the anomalies are assigned an anomaly score of -1. An anomaly score of -1 is assigned to anomalies and 1 to normal.
Now lets explain the output of our model using SHAP
import shap
To get an overview of which features are most important for our model we can plot the SHAP values of every feature for every sample.
The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output.
The color represents the feature value (red high, blue low).
#shap_values = shap.TreeExplainer(model).shap_values(valid_train_sample)
valid_train_sample = shap.sample(valid_train)
# explain the model's predictions using SHAP
explainer = shap.Explainer(model)
shap_values = explainer(valid_train_sample)
shap.plots.beeswarm(shap_values)
We can also just take the mean absolute value of the SHAP values for each feature to get a standard bar plot (produces stacked bars for multi-class outputs):
shap.plots.bar(shap_values)
#shap.summary_plot(shap_values, valid_train_sample, plot_type="bar")
Lets try to use another algorithm called - Local Outlier Factor, to find the frauds and see if we can get better results.
the Local Outlier Factor is an unsupervised Outlier Detection.
The anomaly score of each sample is called the Local Outlier Factor.
It measures the local deviation of the density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood.
Lets look at an image that explains the concept of local and global outliers better. You can see in the image bellow that the red sample is a local outlier while the grey ones are global outliers

Locality is given by k-nearest neighbors, whose distance is used to estimate the local density.
By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

First we want to import the followings
from sklearn.neighbors import LocalOutlierFactor
model_lof = LocalOutlierFactor(novelty=True)
start = time.time()
model_lof.fit(valid_train)
stop = time.time()
train_time_of_local_outlier_factor = stop - start
print(f"Training time: {train_time_of_local_outlier_factor}s")
Training time: 190.89177942276s
Now lets look at the score of this model when given new unseen valid data samples and also fraud data samples.
valid_pred_test = model_lof.predict(valid_test)
outlier_pred = model_lof.predict(fraud)
X does not have valid feature names, but LocalOutlierFactor was fitted with feature names X does not have valid feature names, but LocalOutlierFactor was fitted with feature names
print("Accuracy in Detecting Valid transactions samples:", list(valid_pred_test).count(1)/valid_pred_test.shape[0])
print("Accuracy in Detecting Fraud transactions samples:", list(fraud_pred).count(-1)/fraud_pred.shape[0])
Accuracy in Detecting Valid transactions samples: 0.9678105013917184 Accuracy in Detecting Fraud transactions samples: 1.0
As we can see the results for both algorithms was the same - all in all the results are quite good.
Now lets compare the train time of the different models
train_times = [train_time_of_logistic_regression, train_time_of_isolation_forest, train_time_of_local_outlier_factor]
plt.title('Training Time')
plt.barh(range(len(train_times)), train_times)
plt.yticks(range(len(train_times)), ['train_time_of_logistic_regression', 'isolation_forest', 'local_outlier_factor'])
plt.xlabel('Time in seconds')
Text(0.5, 0, 'Time in seconds')
As we can see the Isolation Forest training time is way faster than LOF, and even faster than Logistic Regression.
How ever the accurecy results of LOF was slightly better then those of the Isolation Forest... but it cose us much more time for this tiny difrance in accurecy - so we think that the Isolation Forest was better than the LOF.
from keras.models import Sequential
from keras.layers import Dense
classifier = Sequential()
classifier.add(Dense(40 , input_dim = 4 , activation = 'relu'))
classifier.add(Dense(30 , input_dim = 40 , activation = 'relu'))
classifier.add(Dense(20 , input_dim = 30 , activation = 'relu'))
classifier.add(Dense(10 , input_dim = 20 , activation = 'relu'))
classifier.add(Dense(6 , input_dim = 10 , activation = 'relu'))
classifier.add(Dense(4 , input_dim = 6 , activation = 'relu'))
classifier.add(Dense(1, input_dim = 4 , activation = 'sigmoid'))
classifier.compile(loss = 'binary_crossentropy' , optimizer = 'adam' , metrics = ['accuracy'] )
classifier.fit( X_train_smote, y_train_smote , epochs = 3 , batch_size = 512 )
Epoch 1/3 17346/17346 [==============================] - 132s 7ms/step - loss: 0.6932 - accuracy: 0.5001 Epoch 2/3 17346/17346 [==============================] - 132s 8ms/step - loss: 0.6932 - accuracy: 0.5001 Epoch 3/3 17346/17346 [==============================] - 132s 8ms/step - loss: 0.6932 - accuracy: 0.4999
<keras.callbacks.History at 0x2638dbf65f0>
Eventualy we saw that the model using Isolation Forest algorithm gave us better preformence than the other algorithm.
Although the data was highly imbalanced the result using Isolation Forest was satisfying.